sam 2
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks.
Robust Ego-Exo Correspondence with Long-Term Memory
Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE).
FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting
Liu, Yi, Song, Jingyu, Kallakuri, Vedanth, Skinner, Katherine A.
Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is https://umfieldrobotics.github.io/FishDetector-R1.
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Mei, Haiyang, Huang, Qiming, Ci, Hai, Shou, Mike Zheng
Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.
SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation
Roddan, Alfie, Czempiel, Tobias, Xu, Chi, Elson, Daniel S., Giannarou, Stamatia
Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA's effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Zhang, Zhixiong, Ding, Shuangrui, Dong, Xiaoyi, He, Songxin, Lin, Jianfan, Tang, Junsong, Zang, Yuhang, Cao, Yuhang, Lin, Dahua, Wang, Jiaqi
Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.
SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2
Adamyan, Alen, ฤรญลพek, Tomรกลก, Straka, Matej, Janouskova, Klara, Schmid, Martin
Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks and has become the state-of-the-art for visual object tracking. The model stores information from previous frames in a memory bank, enabling temporal consistency across video sequences. Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors, occlusions, and object motion. We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2 by framing memory control as a sequential decision-making problem. In an overfitting setup with a separate agent per video, our method achieves a relative improvement over SAM 2 that exceeds by more than three times the gains of existing heuristics. These results reveal the untapped potential of the memory bank and highlight reinforcement learning as a powerful alternative to hand-crafted update rules for memory control in visual object tracking.
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
Shen, Li-Cheng, Hsieh, Jih-Kang, Li, Wei-Hua, Chen, Chu-Song
Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning
-- Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter . In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner . Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM 2) model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes.
Putting the Segment Anything Model to the Test with 3D Knee MRI - A Comparison with State-of-the-Art Performance
Mills, Oliver, Conaghan, Philip, Ravikumar, Nishant, Relton, Samuel
Menisci are cartilaginous tissue found within the knee that contribute to joint lubrication and weight dispersal. Damage to menisci can lead to onset and progression of knee osteoarthritis (OA), a condition that is a leading cause of disability, and for which there are few effective therapies. Accurate automated segmentation of menisci would allow for earlier detection and treatment of meniscal abnormalities, as well as shedding more light on the role the menisci play in OA pathogenesis. Focus in this area has mainly used variants of convolutional networks, but there has been no attempt to utilise recent large vision transformer segmentation models. The Segment Anything Model (SAM) is a so-called foundation segmentation model, which has been found useful across a range of different tasks due to the large volume of data used for training the model. In this study, SAM was adapted to perform fully-automated segmentation of menisci from 3D knee magnetic resonance images. A 3D U-Net was also trained as a baseline. It was found that, when fine-tuning only the decoder, SAM was unable to compete with 3D U-Net, achieving a Dice score of $0.81\pm0.03$, compared to $0.87\pm0.03$, on a held-out test set. When fine-tuning SAM end-to-end, a Dice score of $0.87\pm0.03$ was achieved. The performance of both the end-to-end trained SAM configuration and the 3D U-Net were comparable to the winning Dice score ($0.88\pm0.03$) in the IWOAI Knee MRI Segmentation Challenge 2019. Performance in terms of the Hausdorff Distance showed that both configurations of SAM were inferior to 3D U-Net in matching the meniscus morphology. Results demonstrated that, despite its generalisability, SAM was unable to outperform a basic 3D U-Net in meniscus segmentation, and may not be suitable for similar 3D medical image segmentation tasks also involving fine anatomical structures with low contrast and poorly-defined boundaries.